A Model-based K-means Algorithm for Name Disambiguation
نویسندگان
چکیده
Unambiguous identities of resources are important aspect for semantic web. This paper addresses the personal identity issue in the context of bibliographies. Because of abbreviations or misspelling of names in publications or bibliographies, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of identity matching, document retrieval and database federation, and causes improper attribution of research credit. This paper describes a new K-means clustering algorithm based on an extensible Naïve Bayes probability model to disambiguate authors with the same first name initial and last name in the bibliographies and proposes a canonical name. The model captures three types of bibliographic information: coauthor names, the title of the paper and the title of the journal or proceeding. The algorithm achieves best accuracies of 70.1% and 73.6% on disambiguating 6 different J Anderson s and 9 different "J Smith" s based on the citations collected from researchers publication web pages.
منابع مشابه
An Improved Name Disambiguation Method Based on Atom Cluster
An improved name disambiguation method based on atom cluster. Aiming at the method of character-related properties of similarity based on information extraction depends on the character information, a new name disambiguation method is proposed, and improved k-means algorism for name disambiguation is proposed in this paper. The cluster analysis cluster is introduced to the name disambiguation p...
متن کاملA hybrid DEA-based K-means and invasive weed optimization for facility location problem
In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کاملClustering web people search results using fuzzy ants
Person name queries often bring up web pages that correspond to individuals sharing the same name. The Web People Search (WePS) task consists of organizing search results for ambiguous person name queries into meaningful clusters, with each cluster referring to one individual. This paper presents a fuzzy ant based clustering approach for this multi-document person name disambiguation problem. T...
متن کاملبهبود صحت ابهامزدایی نام نویسنده با استفاده از خوشهبندی تجمّعی
Today, digital libraries are important academic resources including millions of citations and bibliographic essential information such as titles, author's names and location of publications. From the view of knowledge accumulation management, the ability to search fast, accurate, desired contents, has a great importance. The complexity and similarity in these resources cause many challenges and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003